There is a newer version of the record available.

Published February 3, 2025 | Version v1
Dataset Open

Correcting Transiting Lightcurves of Exoplanets From Non-linear Astrophysical and Instrumental Noise

  • 1. ROR icon Cardiff University
  • 2. ROR icon University of Groningen
  • 3. ROR icon University College London
  • 4. ROR icon SRON Netherlands Institute for Space Research

Description

Purpose

This dataset, restructured and curated from the original competition data from 2019 and 2021 edition of the Ariel Data Challenge, is kindly provided by Dr Luís F. Simões from ML Analytics. It is designed to advance the development of machine learning techniques to detrend or denoise non-linear noise from astronomical observation data, including, but not limited to, exoplanet observation. 

Introduction

Exoplanets are planets orbiting other stars, like our own solar system planets orbit our sun. To date, we know over 5500 exoplanets in more than 3400 solar systems (but this number changes daily so have a look NASA Exoplanet webpage for an accuarate number).

When analysing these distant worlds, disentangling the effects of stellar activity and the non-linear noise of the instrument are the major data analysis challenges in the field and directly impacts our scientific measurements. Without correcting for brightness variabilities from the star and sensitivity variations of the instrument, we are not able to measure the radius of the planet correctly and, perhaps more importantly, the chemistry of their atmospheres. 

The Ariel mission is a European Space Agency (ESA) medium size mission (~500M Euros) to be launched to the second Lagrange point in 2029. The goal of the mission is to study the atmospheres and the chemistry of 1000 extrasolar planets (aka exoplanets) in our local galactic neighbourhood. By understanding their atmospheres, we can infer how these planets formed, what their natures are like and ultimately put our own solar system into context. For more information on Ariel, here is the link to the website. 

The dataset contains simulated observations of exoplanets transiting their respective host stars, these lightcurves are corrupted by astrophysical (such as stellar spots) and instrumental noises (such as phton noise, persistent noise and 1/f noise). 

 

Drifts in the data

Data drifts undermines the performance of the model in test time and production environment. Ariel presents a unique challenge where the simulated data is likely not a good representatuion of the instrument's actual performance in space. To simiulate this challenge, the curated dataset combines data from 2019 and 2021 edition, both targeting the same problem (as detailed in the problem statement) but was generated from two different data generation pipelines. The two datasets are separated into folders and are aligned in the sense that the same set of planets (and their respective planetary system) is used to simulate the observation (but with different simulation pipelines).

 

Problem Statement:

Given noisy lightcurves (corrupted by different physical processes) and auxiliary information of the planets and their respective planetary system, devise solutions that converts these noisy observations (referred to as lightcurves) into transmission spectra. The task is conventionally advertised to utilise machine learning techniques, however, users are free to use other methods as appropriate. This task is a major step in the processing of astrophysical observations so that it can eventaully be used to interpret the atmosphere of exoplanets from the resultant spectrum.

 

Reading the Data

Both files are stored in hdf5 format and can be loaded using the following Python code:

import h5py

adc19_data = h5py.File('adc19_core.h5','r')

print(adc19_data.keys())

adc19_data.close()

This format is best for reading in the spectroscopic lightcurve input data ('X') and the targets transmission spectra ('y')

To read the auxiliary  parameters, i.e. their planetary system, we will recommand using Pandas to extract them e.g. 

import pandas as pd

h5_data = pd.HDFStore('adc21_core.h5','r')

df = h5_data['y_params']

or 

df = pd.read_hdf('adc19_core.h5','X_params')

Data Structure

Each hdf5 file contains a nested file structure with the following keys:

1. obs_to_fname:

   - Description: A pandas readable table containing the mapping from observation filenames (e.g., `AAAA_BB_CC.txt`) to tuples `(A, B, C)`, where:

     - `AAAA`: Planet index (0001 to 2097).

     - `BB`: Stellar spot noise instance (01 to 10).

     - `CC`: Gaussian photon noise instance (01 to 10).

 

2. planet:

   - Description: A pandas readable table containing information about the observed planetary systems. This data is repeated for each simulation instance.

   - Shape: (N_planets, ...).

 

3. X_params

   - Description: A pandas readable table containing auxiliary information about the observations, including stellar and planetary parameters.

   - Shape: (N_observations, 9).

   - Columns: `planet`, `stellar_spot`, `photon`, `star_temp`, `star_logg`, `star_rad`, `star_mass`, `star_k_mag`, `period`.

 

4. y_params:

   - Description: A pandas readable table containing auxiliary information about the targets, their photon or stellar spot instances, including optional parameters (`sma` and `incl`). 

   - Shape: (N_observations, 5).

   - Columns: `planet`, `stellar_spot`,` photon`, `sma` (semimajor axis), `incl` (inclination).

 

5. X:

   - Description: Noisy observations of light curves. This is a nested dictionary where each key corresponds to an observation filename (e.g., `0001_01_01.txt`), and the value is a 2D array of relative fluxes.

   - Shape: (55 wavelengths x 300 time steps).

 

6. y:

   - Description: Target values for the regression problem. Another nested dictionary strucutre, where each key contains a 1D array of relative radii (planet-to-star-radius ratios) for each wavelength channel.

   - Shape: (55 wavelengths,).

 

Data Format

Observations (`X`)

- Each observation is a 2D array of relative fluxes, organized as follows:

  - Rows: 55 wavelength channels (`w1` to `w55`).

  - Columns: 300 time steps (`t1` to `t300`).

- Example structure (these numbers are only representative and do not form part of the dataset):

  t1 t2 ... t300
w1 1.00010151742 1.00010151742 ... 1.00010151742
w2 0.999857792623 0.999857792623 ... 0.999857792623
... ... ... ... ...
w55 0.999468565171 0.999468565171 ... 0.999468565171

 

Targets (`y`)

- Each target is a 1D array of relative radii for 55 wavelength channels. These numbers are only representative and do not form part of the dataset. 

  • Example structure:
    w1 w2 ... w55
    1.00010151742 1.00010151742 ... 1.00010151742

     

Auxiliary Parameters (`X_params` and `y_params`)

- `X_params` contains stellar and planetary parameters for each observation.

- `y_params` contains optional parameters (`sma` and `incl`) that can be used as intermediate targets or ignored. Other columns contains instances of the photon or stellar spots. 

 

Files

Files (11.4 GB)

Name Size Download all
md5:9b6a055f0bd394e1def95fda3432632d
5.7 GB Download
md5:0d9ed358b93ae78189c6799e72782a79
5.7 GB Download

Additional details

Additional titles

Subtitle
Curated Dataset from Ariel Data Challenge 2019 and 2021

Funding

European Commission
ExoAI - Deciphering super-Earths using Artificial Intelligence 758892

Software

Programming language
Python
Development Status
Active